<h1>1. Detection Evaluation</h1>
<p>This page describes the <i>detection evaluation metrics</i> used by COCO. The evaluation code provided here can be used to obtain results on the publicly available COCO validation set. It computes multiple metrics described below. To obtain results on the COCO test set, for which ground-truth annotations are hidden, generated results must be <a href="#upload">uploaded</a> to the evaluation server. The exact same evaluation code, described below, is used to evaluate results on the test set.</p>

<h1>2. Metrics</h1>
<p>The following 12 metrics are used for characterizing the performance of an object detector on COCO:</p>
<div class="json fontMono">
  <div class="jsonreg"><b>Average Precision (AP):</b></div>
  <div class="jsonk">AP</div><div class="jsonv">% AP at IoU=.50:.05:.95 <b>(primary challenge metric)</b></div>
  <div class="jsonk">AP<sup>IoU=.50</sup></div><div class="jsonv">% AP at IoU=.50 (PASCAL VOC metric)</div>
  <div class="jsonk">AP<sup>IoU=.75</sup></div><div class="jsonv">% AP at IoU=.75 (strict metric)</div>
  <div class="jsonreg"><b>AP Across Scales:</b></div>
  <div class="jsonk">AP<sup>small</sup></div><div   class="jsonv">% AP for small objects: area &lt; 32<sup>2</sup></div>
  <div class="jsonk">AP<sup>medium</sup></div><div  class="jsonv">% AP for medium objects: 32<sup>2</sup> &lt; area &lt; 96<sup>2</sup></div>
  <div class="jsonk">AP<sup>large</sup></div><div   class="jsonv">% AP for large objects: area &gt; 96<sup>2</sup></div>
  <div class="jsonreg"><b>Average Recall (AR):</b></div>
  <div class="jsonk">AR<sup>max=1</sup></div><div   class="jsonv">% AR given 1 detection per image </div>
  <div class="jsonk">AR<sup>max=10</sup></div><div  class="jsonv">% AR given 10 detections per image</div>
  <div class="jsonk">AR<sup>max=100</sup></div><div class="jsonv">% AR given 100 detections per image</div>
  <div class="jsonreg"><b>AR Across Scales:</b></div>
  <div class="jsonk">AR<sup>small</sup></div><div   class="jsonv">% AR for small objects: area &lt; 32<sup>2</sup></div>
  <div class="jsonk">AR<sup>medium</sup></div><div  class="jsonv">% AR for medium objects: 32<sup>2</sup> &lt; area &lt; 96<sup>2</sup></div>
  <div class="jsonk">AR<sup>large</sup></div><div   class="jsonv">% AR for large objects: area &gt; 96<sup>2</sup></div>
</div><br/>
<ol class="fontSmall">
  <li>Unless otherwise specified, <i>AP and AR are averaged over multiple Intersection over Union (IoU) values</i>. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric AP<sup>IoU=.50</sup>). Averaging over IoUs rewards detectors with better localization.</li>
  <li>AP is averaged over all categories. Traditionally, this is called "mean average precision" (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.</li>
  <li>AP (averaged across all 10 IoU thresholds and all 80 categories) will determine the challenge winner. This should be considered the single most important metric when considering performance on COCO.</li>
  <li>In COCO, there are more small objects than large objects. Specifically: approximately 41% of objects are small (area &lt; 32<sup>2</sup>), 34% are medium (32<sup>2</sup> &lt; area &lt; 96<sup>2</sup>), and 24% are large (area &gt; 96<sup>2</sup>). Area is measured as the number of pixels in the segmentation mask.</li>
  <li>AR is the maximum recall given a fixed number of detections per image, averaged over categories and IoUs. AR is related to the metric of the same name used in <a href="http://arxiv.org/abs/1502.05082" target="_blank">proposal evaluation</a> but is computed on a per-category basis.</li>
  <li>All metrics are computed allowing for at most 100 top-scoring detections per image (across all categories).</li>
  <li>The evaluation metrics for detection with bounding boxes and segmentation masks are identical in all respects except for the IoU computation (which is performed over boxes or masks, respectively).</li>
</ol>

<h1>3. Evaluation Code</h1>
<p>Evaluation code is available on the <a href="https://github.com/cocodataset/cocoapi" target="_blank">COCO github</a>. Specifically, see either <a href="https://github.com/cocodataset/cocoapi/blob/master/MatlabAPI/CocoEval.m" target="_blank">CocoEval.m</a> or <a href="https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py" target="_blank">cocoeval.py</a> in the Matlab or Python code, respectively. Also see <span class="fontMono">evalDemo</span> in either the Matlab or Python code (<a href="https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb" target="_blank">demo</a>). Before running the evaluation code, please prepare your results in the format described on the <a href="#format-results">results format</a> page.</p>
<p>The evaluation parameters are as follows (defaults in brackets, in general no need to change):</p>
<div class="json">
  <div class="jsonreg">params{</div>
  <div class="jsonk">"imgIds"  </div><div class="jsonv">: [all] N img ids to use for evaluation</div>
  <div class="jsonk">"catIds"  </div><div class="jsonv">: [all] K cat ids to use for evaluation</div>
  <div class="jsonk">"iouThrs" </div><div class="jsonv">: [.5:.05:.95] T=10 IoU thresholds for evaluation</div>
  <div class="jsonk">"recThrs" </div><div class="jsonv">: [0:.01:1] R=101 recall thresholds for evaluation</div>
  <div class="jsonk">"areaRng" </div><div class="jsonv">: [all,small,medium,large] A=4 area ranges for evaluation</div>
  <div class="jsonk">"maxDets" </div><div class="jsonv">: [1 10 100] M=3 thresholds on max detections per image</div>
  <div class="jsonk">"useSegm" </div><div class="jsonv">: [1] if true evaluate against ground-truth segments</div>
  <div class="jsonk">"useCats" </div><div class="jsonv">: [1] if true use category labels for evaluation</div>
  <div class="jsonreg">}</div>
</div>
<p>Running the evaluation code via calls to <span class="fontMono">evaluate()</span> and <span class="fontMono">accumulate()</span> produces two data structures that measure detection quality. The two structs are <span class="fontMono">evalImgs</span> and <span class="fontMono">eval</span>, which measure quality per-image and aggregated across the entire dataset, respectively. The evalImgs struct has KxA entries, one per evaluation setting, while the eval struct combines this information into precision and recall arrays. Details for the two structs are below (see also <a href="https://github.com/cocodataset/cocoapi/blob/master/MatlabAPI/CocoEval.m" target="_blank">CocoEval.m</a> or <a href="https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py" target="_blank">cocoeval.py</a>):</p>
<div class="json">
  <div class="jsonreg">evalImgs[{</div>
  <div class="jsonk">"dtIds"     </div><div class="jsonv">: [1xD] id for each of the D detections (dt)</div>
  <div class="jsonk">"gtIds"     </div><div class="jsonv">: [1xG] id for each of the G ground truths (gt)</div>
  <div class="jsonk">"dtImgIds"  </div><div class="jsonv">: [1xD] image id for each dt</div>
  <div class="jsonk">"gtImgIds"  </div><div class="jsonv">: [1xG] image id for each gt</div>
  <div class="jsonk">"dtMatches" </div><div class="jsonv">: [TxD] matching gt id at each IoU or 0</div>
  <div class="jsonk">"gtMatches" </div><div class="jsonv">: [TxG] matching dt id at each IoU or 0</div>
  <div class="jsonk">"dtScores"  </div><div class="jsonv">: [1xD] confidence of each dt</div>
  <div class="jsonk">"dtIgnore"  </div><div class="jsonv">: [TxD] ignore flag for each dt at each IoU</div>
  <div class="jsonk">"gtIgnore"  </div><div class="jsonv">: [1xG] ignore flag for each gt</div>
  <div class="jsonreg">}]</div>
</div><br/>
<div class="json">
  <div class="jsonreg">eval{</div>
  <div class="jsonk">"params"    </div><div class="jsonv">: parameters used for evaluation</div>
  <div class="jsonk">"date"      </div><div class="jsonv">: date evaluation was performed</div>
  <div class="jsonk">"counts"    </div><div class="jsonv">: [T,R,K,A,M] parameter dimensions (see above)</div>
  <div class="jsonk">"precision" </div><div class="jsonv">: [TxRxKxAxM] precision for every evaluation setting</div>
  <div class="jsonk">"recall"    </div><div class="jsonv">: [TxKxAxM] max recall for every evaluation setting</div>
  <div class="jsonreg">}</div>
</div>
<p>Finally <span class="fontMono">summarize()</span> computes the 12 detection metrics defined earlier based on the <span class="fontMono">eval</span> struct.</p>

<h1>4. Analysis Code</h1>
<p>In addition to the evaluation code, we also provide a function <span class="fontMono">analyze()</span> for performing a detailed breakdown of false positives. This was inspired by <a href="http://web.engr.illinois.edu/~dhoiem/projects/detectionAnalysis/" target="_blank">Diagnosing Error in Object Detectors</a> by Derek Hoiem et al., but is quite different in implementation and details. The code generates plots like this:</p>
<p>
  <img src="images/detection-analysis-person.jpg" class="wide40" />
  <img src="images/detection-analysis-all.jpg" class="wide40" />
</p>
<p>Both plots show analysis of the <a href="http://arxiv.org/abs/1512.03385" target="_blank">ResNet</a> (bbox) detector from Kaiming He et al., winner of the <a href="#detection-2015">2015 Detection Challenge</a>. The first plot shows a breakdown of errors of ResNet for the person class; the second plot is an overall analysis of ResNet averaged over all categories.</p>
<p>Each plot is a series of precision recall curves where each PR curve is guaranteed to be strictly higher than the previous as the evaluation setting becomes more permissive. The curves are as follows:</p>
<ol class="fontSmall">
  <li><b>C75</b>: PR at IoU=.75 (AP at strict IoU), area under curve corresponds to AP<sup>IoU=.75</sup> metric.</li>
  <li><b>C50</b>: PR at IoU=.50 (AP at PASCAL IoU), area under curve corresponds to AP<sup>IoU=.50</sup> metric.</li>
  <li><b>Loc</b>: PR at IoU=.10 (localization errors ignored, but not duplicate detections). All remaining settings use IoU=.1.</li>
  <li><b>Sim</b>: PR after supercategory false positives (fps) are removed. Specifically, any matches to objects with a different class label but that belong to the same supercategory don't count as either a fp (or tp). Sim is computed by setting all objects in the same supercategory to have the same class label as the class in question and setting their ignore flag to 1. Note that person is a singleton supercategory so its Sim result is identical to Loc.</li>
  <li><b>Oth</b>: PR after all class confusions are removed. Similar to Sim, except now if a detection matches <i>any</i> other object it is no longer a fp (or tp). Oth is computed by setting all other objects to have the same class label as the class in question and setting their ignore flag to 1.</li>
  <li><b>BG</b>: PR after all background (and class confusion) fps are removed. For a single category, BG is a step function that is 1 until max recall is reached then drops to 0 (the curve is smoother after averaging across categories).</li>
  <li><b>FN</b>: PR after all remaining errors are removed (trivially AP=1).</li>
</ol>
<p>The area under each curve is shown in brackets in the legend. In the case of the ResNet detector, overall AP at IoU=.75 is .399 and perfect localization would increase AP to .682. Interesting, removing all class confusions (both within supercategory and across supercategories) would only raise AP slightly to .713. Removing background fp would bump performance to .870 AP and the rest of the errors are missing detections (although presumably if more detections were added this would also add lots of fps). In summary, ResNet's errors are dominated by imperfect localization and background confusions.</p>
<p>For a given detector, the code generates a total of 372 plots! There are 80 categories, 12 supercategories, and 1 overall result, for a total of 93 different settings, and the analysis is performed at 4 scales (all, small, medium, large, so 93*4=372 plots). The file naming is [supercategory]-[category]-[size].pdf for the 80*4 per-category results, overall-[supercategory]-[size].pdf for the 12*4 per supercategory results, and overall-all-[size].pdf for the 1*4 overall results. Of all the plots, typically the overall and supercategory results are of the most interest.</p>
<p><b>Note:</b> <span class="fontMono">analyze()</span> can take significant time to run, please be patient. As such, we typically do not run this code on the evaluation server; you must run the code locally using the validation set. Finally, currently <span class="fontMono">analyze()</span> is only part of the Matlab API; <i>Python code coming soon</i>.</p>
